\usepackage{titling} \usepackage{color} \definecolor{RB_blue}{RGB}{0, 108, 149} \pretitle{ \begin{center} \color{RB_blue} \huge{\bf A Guide to Bayesian Modeling} \vspace{6.0cm} \setlength{\arrayrulewidth}{0.5mm} \setlength{\tabcolsep}{18pt} \renewcommand{\arraystretch}{1.5} \color{black} \includegraphics[]{logo.jpg}\\[\bigskipamount] \vspace{6.0cm} \begin{flushright} \begin{tabular}{ c } \small Author: Belias Michael\\ \small PhD in Biostatistics \\ \small Athens, 20 April 2017\\ \end{tabular} \end{flushright} \newpage } \posttitle{\end{center}} \usepackage{fancyhdr} \pagestyle{fancy} \rhead{\includegraphics[width = .07 \textwidth]{small_logo.jpg}} \fancyfoot[CE,CO]{\leftmark} \fancyfoot[LE,RO]{\thepage} \renewcommand{\headrulewidth}{2pt} \renewcommand{\footrulewidth}{1pt} \paperheight = 875pt
According to the Oxford dictionary, statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. Data may be of any applied science field such as medical, finance, social, physics etc and they can be separated into 2 types quantitative and qualitative.
The typical steps of a statistical analysis are seven in general :
Define the problem
Data collection and manipulation
Explore the data
Using the above three decide the model that will be used
Fit the model
Check the model and develop if necessary
Make the final model and infere
The above step are not distinct and in some cases there are overlaps and more steps nested. The same principles can be applied in the Bayesian Framework too.
In this tutorial we will learn:
The uniform distribution is the simplest discrete probability distribution. It assigns equal probability to N different outcomes, usually represented with numbers 1,2,….,N .
X ~ Uniform(N)
\(P(X = x|N) = 1/N\) for x = 1,2,…N
\(E[X]= \frac{N+1}{2}\)
\(Var[X]= \frac{N^2+1}{12}\)
One common example is the outcome of throwing a fair six-sided die where N=6.
The Bernoulli distribution is used for binary outcomes, such as 0 and 1. It has one parameter p, which is the probability of “success” frequently geting 1 (or any value we set). X ~ Bern(p)
\(P(X = x | p) = p^x(1-p)^{1-x}\) for x = 0, 1
E[X] = p
Var[X] = p(1-p)
One common example is the outcome of flipping a fair coin (p = 0.5)
The binomial distribution counts the number of “successes” in n independent Bernoulli trials (each with the same probability of success). Thus if \(X_1 , X_2 ,..., X_n\) are independent Bernoulli(p) random variables, then Y = \(\sum^n_{i=1} X_i\) is binomial distributed.
Y ~Binom(n, p)
P(Y= y|n,p) = \(\binom{n}{y} {p^y}(1-p)^{(n-y)}\) , for y = 0,1, …, n
E[Y] =np
Var[Y]= np(1-p)
where \(\binom{n}{y}= \frac{n!}{y!(n-y)!}\) .
The Poisson distribution is used for counts, and arises in a variety of situations. The parameter \(\lambda\) > 0 is the rate at which we expect to observe the thing we are counting.
X~Pois(\(\lambda\))
\(P(X= x|\lambda) = \frac{\lambda^xexp(-\lambda)}{x!}\)
E[X] = \(\lambda\)
Var[X] = \(\lambda\)
A Poisson process is a process wherein events occur on average at rate \(\lambda\), events occur one at a time, and events occur independently of each other.
Example:
Significant earthquakes occur in the Western United States approximately following a Poisson process with rate of two earthquakes per week. What is the probability there will be at least 3 earthquakes in the next two weeks? Answer: the rate per two weeks is 2*2 = 4, so let X ~Pois(4) and we want to know \(P(X \geq 3)=1-(X\leq 2) =1- P(X = 0) - P( X = 1 ) - P( X = 2) =1 - e^{-4} _ 4e^{-4} -\frac{4^2 e^{-4}}{2} = 1 - 13e^{-4} = 0.762\)
Note that 0! = 1 by definition.
The geometric distribution is the number of failures before obtaining the first success, i.e., the number of Bernoulli failures until a success is observed, such as the first head when flipping a coin. It takes values on the positive integers starting with 0 (alternatively, we could count total trials until first success, in which case we would start with 1).
X~Geo(p)
\(P(X=x|p) = p(1-p)^x\) , for x=1,2,…
\(E[X] = \frac{1-p}{p}\)
If the probability of getting a success is p, then the expected number of trials until the first success is 1=p and the expected number of failures until the first success is (1 - p)=p.
The negative binomial distribution extends the geometric distribution to model the number of failures before achieving the rth success. It takes values on the positive integers starting with 0.
Y~NegBinom(r,p)
P(Y= y|n,p) = \(\binom{r + y - 1}{y} {p^r}(1-p)^{(y)}\) for y=1,2,…
E[Y] = \(\frac{r(1-p)}{p}\)
Var[Y] = \(\frac{r(1-p)}{p^2}\)
Note that the geometric distribution is a special case of the negative binomial distribution where r = 1. Because 0 < p < 1, we have E[Y] < Var[Y]. This makes the negative binomial a popular alternative to the Poisson when modeling counts with high variance (recall, that the mean equals the variance for Poisson distributed variables).
Another generalization of the Bernoulli and the binomial is the multinomial distribution, which is like a binomial when there are more than two possible outcomes. Suppose we have n trials and there are k different possible outcomes which occur with probabilities \(p_1,p_2,... p_k\). For example, we are rolling a six-sided die that might be loaded so that the sides are not equally likely, then n is the total number of rolls, k = 6, p1 is the probability of rolling a one, and we denote by \(x_1,x_2, ...,x_6\) a possible outcome for the number of times we observe rolls of each of one through six, where \(\sum_{i=1}^{6}x_i = n\) and \(\sum_{i=1}^{6}p_i = 1\)
Bayesian statistics are based on the homonymous Bayes’ theorem or rule, invented by Thomas Bayes, which was a british reverend the 1740s . His primary field of studying was theology but Bayes was also “amateur” mathematician. He was influenced by David Hume a philosopher teacher while his studies in Edinburgh proposing that we can only rely on what we learn from experience. The probabilities as a mathematical field these days where just emerging being able to solve simple problems like what is the probability of observing an effect given a cause? but not the inverse P(cause | effect). Bayes gave a simple example of tossing balls on a table and recording where they stop (to the left or to the right side of the table), noting that the more balls are thrown, the better we may infere if the ball-tosing is bias to a side. This is nowadays called a learning process and although it was a remarkable finding Bayes forgot it in a drawer (!) until his death.Richard Price found it and after studing his papers for 2 years and making some corrections he finally published An Essay toward solving a Problem in the Doctrine of Chances”. 1763.
Still the theorem was just an example not having the final form of today and even after this publication no-one really continued the development except of Laplace, who was trying to solve an astronomical problem , studied Price’s paper developed a first version of what we now call Bayes theorem. The reception of Laplace’s proposal was slightly hostile due to the inherent challenges such as the equal prior probabilities, being subjective and the serious technical computational problems in practice, which is still a great issue .
Bayes theorem is calculating the probability event given prior knowledge of conditions that might be related to the event. Bayes’ theorem is stated mathematically as the following equation (Efron, 2013) :
\({P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}}\)
\({P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}}\)
where A and B are events and P(B) \(\neq\) 0.
This is the basis of Bayesian inference which is a particular approach to statistical inference, with it’s own interpretation and When applied, the probabilities involved in Bayes’ theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a subjective degree of belief should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.
Before we learn how to simulate from complicated posterior distributions, let’s review some of the basics of Monte Carlo estimation. Monte Carlo estimation refers to simulating hypothetical draws from a probability distribution in order to calculate important quantities. These quantities might include the mean, the variance, the probability of some event, or quantiles of the distribution. All of these calculations involve integration, which except for the simplest distributions, can be very difficult or impossible.
Suppose we have a random variable \(\hat \theta\) that follows a Gamma(a,b). Let’s say a=2 and b=1/3 , where b is the rate parameter. To calculate the mean of this distribution, we would need to compute the following integral
\[\text{E}(\theta) = \int_0^\infty \theta f(\theta) d\theta = \int_0^\infty \theta \frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta} d\theta \,\]
It is possible to compute this integral, and the answer is a/b (6 in this case). However, we could verify this answer through Monte Carlo estimation. To do so, we would simulate a large number of draws (call them \(\theta_i^*\) for i = 1,2,…,m) from this gamma distribution and calculate their sample mean. Why can we do this? Recall from the previous course that if we have a random sample from a distribution, the average of those samples converges in probability to the true mean of that distribution by the Law of Large Numbers. Furthermore, by the Central Limit Theorem (CLT), this sample mean \(\bar \theta^* = \frac{1}{m} \sum_{i=1}^{m} \theta_i^*\) approximately follows a normal distribution with mean E(\(\theta\)) and variance Var(\(\theta\))/m. The theoretical variance of \(\theta\) is the following integral: \[\text{Var}(\theta) = \int_0^\infty (\theta-E(\theta))^2 f(\theta) d\theta \,\]
Just like we did with the mean, we could approximate this variance with the sample variance \(\frac{1}{m}\sum_{i=1}^m (\theta_i^* - \bar{\theta}^*)^2\)
This method can be used to calculate many different integrals. Say h(\(\theta\)) is any function and we want to calculate \(\int h(\theta) p(\theta) d\theta\) This integral is precisely what is meant by E(h(\(\theta\))), so we can conveniently approximate it by taking the sample mean of h(\(\theta_i^*\)). That is, we apply the function hh to each simulated sample from the distribution, and take the average of all the results.
One extremely useful example of an hh function is is the indicator \(I_A(\theta)\) where A is some logical condition about the value of \(\theta\). To demonstrate, suppose \(h(\theta) = I_{[\theta<5]}(\theta)\) , which will give a 1 if \(\theta<5\) and 0 otherwise.
The E(h(\(\theta\))) = \(\int_0^\infty I_{[\theta<5]}(\theta) p(\theta) d\theta = \int_0^5 1 \cdot p(\theta) d\theta + \int_5^\infty 0 \cdot p(\theta) d\theta = P(\theta < 5) \,\) . This means we can approximate the probability that \(\theta < 5\) by drawing many samples \(\theta_i^*\) , and approximating this integral with \(\frac{1}{m} \sum_{i=1}^m I_{\theta^* < 5} (\theta_i^*)\). This expression is simply counting how many of those samples come out to be less than 55, and dividing by the total number of simulated samples. So simple!
Likewise, we can approximate quantiles of a distribution. If we are looking for the value \(\zeta\) such that P(\(\theta\) < z) = 0.9 , we simply arrange the samples \(\theta_i^*\) in ascending order and find the smallest drawn value that is greater than 90% of the others.
How good is an approximation by Monte Carlo sampling? Again we can turn to the CLT, which tells us that the variance of our estimate is controlled in part by m . For a better estimate, we want larger m .
For example, if we seek E(\(\theta\)), then the sample mean \(\bar \theta^*\) approximately follows a normal distribution with mean E(\(\theta\)) and variance Var(\(\theta\))/m. The variance tells us how far our estimate might be from the true value. One way to approximate Var(\(\theta\)) is to replace it with the sample variance. The standard deviation of our Monte Carlo estimate is the square root of that, or the sample standard deviation divided by \(\sqrt m\). If m is large, it is reasonable to assume that the true value will likely be within about two standard deviations of your Monte Carlo estimate.
We can also obtain Monte Carlo samples from hierarchical models. As a simple example, let’s consider a binomial random variable where \(y \mid \phi \sim \text{Bin}(10, \phi)\), and further suppose \(\phi\) is random (as if it had a prior) and is distributed beta \(\phi ~ Beta(2,2)\). Given any hierarchical model, we can always write the joint distribution of y and \(\phi\) as \(p(y, \phi) = p(y \mid \phi)p(\phi)\) using the chain rule of probability. To simulate from this joint distribution, repeat these steps for a large number m :
This will produce m independent pairs of \((y^*, \phi^*)_i\) drawn from their joint distribution. One major advantage of Monte Carlo simulation is that marginalizing is easy. Calculating the marginal distribution of y, \(p(y) = \int_0^1 p(y, \phi) d\phi\) might be challenging. But if we have draws from the joint distribution, we can just discard the \(\phi_i^*\) raws and use the \(y_i^*\)as samples from their marginal distribution. This is also called the prior predictive distribution introduced in the previous course.
In the next segment, we will demonstrate some of these principles. Remember, we do not yet know how to sample from the complicated posterior distributions introduced in the previous lesson. But once we learn that, we will be able to use the principles from this lesson to make approximate inferences from those posterior distributions.
Definition If we have a sequence of random variables \(X_1,X_2,...,X_n\) where the indices 1,2,…,n represent successive points in time, we can use the chain rule of probability to calculate the probability of the entire sequence:
\(p(X_{t+1} | X_t, X_{t-1}, \ldots, X_2, X_1 ) = p(X_{t+1} | X_t) \,\)
Markov chains simplify this expression by using the Markov assumption. The assumption is that given the entire past history, the probability distribution for the random variable at the next time step only depends on the current variable. Mathematically, the assumption is written like this:
\(p(X_{t+1} | X_t, X_{t-1}, \ldots, X_2, X_1 ) = p(X_{t+1} | X_t) \,\)
for all t=2,…,n. Under this assumption, we can write the first expression as \(p(X_1, X_2, \ldots X_n) = p(X_1) \cdot p(X_2 | X_1) \cdot p(X_3 | X_2) \cdot p(X_4 | X_3) \cdot \ldots \cdot p(X_n | X_{n-1}) \, ,\)
which is much simpler than the original. It consists of an initial distribution for the first variable, \(P(X_1)\) , and n - 1 transition probabilities. We usually make one more assumption: that the transition probabilities do not change with time. Hence, the transition from time tt to time t+1 depends only on the value of \(X_t\).
Suppose you have a secret number (make it an integer) between 1 and 5. We will call it your initial number at step 1. Now for each time step, your secret number will change according to the following rules:
Before the experiment, we can think of the sequence of secret numbers as a sequence of random variables, each taking on a value in {1,2,3,4,5}. Assume that the coin is fair, so that with each flip, the probability of heads and tails are both 0.5.
Does this game qualify as a true Markov chain? Suppose your secret number is currently 4 and that the history of your secret numbers is (2,1,2,3). What is the probability that on the next step, your secret number will be 5? What about the other four possibilities? Because of the rules of this game, the probability of the next transition will depend only on the fact that your current number is 4. The numbers further back in your history are irrelevant, so this is a Markov chain.
This is an example of a discrete Markov chain, where the possible values of the random variables come from a discrete set. Those possible values (secret numbers in this example) are called states of the chain. The states are usually numbers, as in this example, but they can represent anything. In one common example, the states describe the weather on a particular day, which could be labeled as 1-fair, 2-poor.
Now let’s look at a continuous example of a Markov chain. Say Xt=0Xt=0 and we have the following transition model: \(p(X_{t+1} | X_t=x_t) = \text{N}(x_t, 1),\). That is, the probability distribution for the next state is Normal with variance 1 and mean equal to the current state. This is often referred to as a “random walk.” Clearly, it is a Markov chain because the transition to the next state \(X_{t+1}\)only depends on the current state \(X_t\).
R-code
Let’s return to our example of the discrete Markov chain. If we assume that transition probabilities do not change with time, then there are a total of \(5^2 = 25\) potential transition probabilities. Potential transition probabilities would be from State 1 to State 2, State 1 to State 3, and so forth. These transition probabilities can be arranged into a matrix Q:
\[Q = \begin{pmatrix} 0 & .5 & 0 & 0 & .5 \\ .5 & 0 & .5 & 0 & 0 \\ 0 & .5 & 0 & .5 & 0 \\ 0 & 0 & .5 & 0 & .5 \\ .5 & 0 & 0 & .5 & 0 \\ \end{pmatrix}\]
where the transitions from State 1are in the first row, the transitions from State 2 are in the second row, etc. For example, the probability \(p(X_{t+1} = 5 \mid X_t = 4)\) can be found in the fourth row, fifth column.
The transition matrix is especially useful if we want to find the probabilities associated with multiple steps of the chain. For example, we might want to know \(p(X_{t+2}=3 \mid X_t=1)\) , the probability of your secret number being 3 two steps from now, given that your number is currently 1. We can calculate this as \(\sum_{k=1}^5 p(X_{t+2}=3 \mid X_{t+1}=k) \cdot p(X_{t+1}=k \mid X_t=1)\) , which conveniently is found in the first row and third column of \(Q^2\).
R-code
[,1] [,2] [,3] [,4] [,5]
[1,] 0.50 0.00 0.25 0.25 0.00
[2,] 0.00 0.50 0.00 0.25 0.25
[3,] 0.25 0.00 0.50 0.00 0.25
[4,] 0.25 0.25 0.00 0.50 0.00
[5,] 0.00 0.25 0.25 0.00 0.50
Suppose we want to know the probability distribution of the your secret number in the distant future, say \(p(X_{t+h} | X_t)\) where h is a large number. Let’s calculate this for a few different values of h.
[,1] [,2] [,3] [,4] [,5]
[1,] 0.062 0.312 0.156 0.156 0.312
[2,] 0.312 0.062 0.312 0.156 0.156
[3,] 0.156 0.312 0.062 0.312 0.156
[4,] 0.156 0.156 0.312 0.062 0.312
[5,] 0.312 0.156 0.156 0.312 0.062
[,1] [,2] [,3] [,4] [,5]
[1,] 0.248 0.161 0.215 0.215 0.161
[2,] 0.161 0.248 0.161 0.215 0.215
[3,] 0.215 0.161 0.248 0.161 0.215
[4,] 0.215 0.215 0.161 0.248 0.161
[5,] 0.161 0.215 0.215 0.161 0.248
[,1] [,2] [,3] [,4] [,5]
[1,] 0.201 0.199 0.200 0.200 0.199
[2,] 0.199 0.201 0.199 0.200 0.200
[3,] 0.200 0.199 0.201 0.199 0.200
[4,] 0.200 0.200 0.199 0.201 0.199
[5,] 0.199 0.200 0.200 0.199 0.201
Notice that as the future horizon gets more distant, the transition distributions appear to converge. The state you are currently in becomes less important in determining the more distant future. If we let hh get really large, and take it to the limit, all the rows of the long-range transition matrix will become equal to (.2,.2,.2,.2,.2). That is, if you run the Markov chain for a very long time, the probability that you will end up in any particular state is 1/5=.2 for each of the five states. These long-range probabilities are equal to what is called the stationary distribution of the Markov chain.
The stationary distribution of a chain is the initial state distribution for which performing a transition will not change the probability of ending up in any given state. That is,
[,1] [,2] [,3] [,4] [,5]
[1,] 0.2 0.2 0.2 0.2 0.2
One consequence of this property is that once a chain reaches its stationary distribution, the stationary distribution will remain the distribution of the states thereafter.
We can also demonstrate the stationary distribution by simulating a long chain from this example.
Now that we have simulated the chain, let’s look at the distribution of visits to the five states.
x
1 2 3 4 5
0.1996 0.2020 0.1980 0.1994 0.2010
The overall distribution of the visits to the states is approximately equal to the stationary distribution.
As we have just seen, if you simulate a Markov chain for many iterations, the samples can be used as a Monte Carlo sample from the stationary distribution. This is exactly how we are going to use Markov chains for Bayesian inference. In order to simulate from a complicated posterior distribution, we will set up and run a Markov chain whose stationary distribution is the posterior distribution.
It is important to note that the stationary distribution doesn’t always exist for any given Markov chain. The Markov chain must have certain properties, which we won’t discuss here. However, the Markov chain algorithms we’ll use in future lessons for Monte Carlo estimation are guaranteed to produce stationary distributions.
The continuous random walk example we gave earlier does not have a stationary distribution. However, we can modify it so that it does have a stationary distribution.
Let the transition distribution be \(p(X_{t+1} | X_t=x_t) = \text{N}(\phi x_t, 1)\) where -1<\(\phi\)<1 .That is, the probability distribution for the next state is Normal with variance 1 and mean equal to \(\phi\) times the current state. As long as \(\phi\) is between -1 and* 1*, then the stationary distribution will exist for this model.
Let’s simulate this chain for \(\phi = -0.6\).
The theoretical stationary distribution for this chain is normal with mean 0 and variance 1/(1-\(\phi^2\)) which in our example approximately equals 1.562. Let’s look at a histogram of our chain and compare that with the theoretical stationary distribution.
It appears that the chain has reached the stationary distribution. Therefore, we could treat this simulation from the chain like a Monte Carlo sample from the stationary distribution, a normal with mean 0 and variance 1.562.
Because most posterior distributions we will look at are continuous, our Monte Carlo simulations with Markov chains will be similar to this example.
Metropolis-Hastings is an algorithm that allows us to sample from a generic probability distribution (which we will call the target distribution), even if we do not know the normalizing constant. To do this, we construct and sample from a Markov chain whose stationary distribution is the target distribution. It consists of picking an arbitrary starting value, and iteratively accepting or rejecting candidate samples drawn from another distribution, one that is easy to sample.
Let’s say we wish to produce samples from a target distribution \(p(\theta) \propto g(\theta)\) where we don’t know the normalizing constant (since \(\int g(\theta) d\theta\) is hard or impossible to compute), so we only have \(g(\theta)\) to work with. The Metropolis-Hastings algorithm proceeds as follows.
If \(\alpha \ge 1\), then set \(\theta_i = \theta^*\) If \(\alpha < 1\), then set \(\theta_i = \theta^*\) with probability \(\alpha\), or \(\theta_i = \theta_{i-1}\) with probability 1-\(\alpha\).
Steps 2b and 2c act as a correction since the proposal distribution is not the target distribution. At each step in the chain, we draw a candidate and decide whether to “move” the chain there or remain where we are. If the proposed move to the candidate is “advantageous,” (\(\alpha \ge 1\)) we “move” there and if it is not “advantageous,” we still might move there, but only with probability \(\alpha\). Since our decision to “move” to the candidate only depends on where the chain currently is, this is a Markov chain.
One careful choice we must make is the candidate generating distribution\(q(\theta^* \mid \theta_{i-1})\) It may or may not depend on the previous iteration’s value of \(\theta\). One example where it doesn’t depend on the previous value would be if \(q(\theta)\)is always the same distribution. If we use this option, \(q(\theta)\) should be as similar as possible to \(p(\theta)\).
Another popular option, one that does depend on the previous iteration, is Random-Walk Metropolis-Hastings. Here, the proposal distribution is centered on \(\theta_{i-1}\). For instance, it might be a normal distribution with mean \(\theta_{i-1}\). Because the normal distribution is symmetric, this example comes with another advantage: \(q(\theta^* \mid \theta_{i-1}) = q(\theta_{i-1} \mid \theta^*)\), causing it to cancel out when we calculate \(\alpha\). Thus, in Random-Walk Metropolis-Hastings where the candidate is drawn from a normal with mean \(\theta_{i-1}\) and constant variance, the acceptance ratio is \(\alpha = \frac{g(\theta^*) }{g(\theta_{i-1})}\).
Clearly, not all candidate draws are accepted, so our Markov chain sometimes “stays” where it is, possibly for many iterations. How often you want the chain to accept candidates depends on the type of algorithm you use. If you approximate \(p(\theta)\) with \(q(\theta^*)\)and always draw candidates from that, accepting candidates often is good; it means \(q(\theta^*)\)is approximating \(p(\theta)\) well. However, you still may want qq to have a larger variance than pp and see some rejection of candidates as an assurance that qq is covering the space well.
As we will see in coming examples, a high acceptance rate for the Random-Walk Metropolis-Hastings sampler is not a good thing. If the random walk is taking too small of steps, it will accept often, but will take a very long time to fully explore the posterior. If the random walk is taking too large of steps, many of its proposals will have low probability and the acceptance rate will be low, wasting many draws. Ideally, a random walk sampler should accept somewhere between 23% and 50% of the candidates proposed.
In the next segment, we will see a demonstration of this algorithm used in a discrete case, where we can show mathematically that the Markov chain converges to the target distribution. In the following segment, we will demonstrate coding a Random-Walk Metropolis-Hastings algorithm in R to solve one of the problems from the end of Lesson 2.
Recall the model from the last segment of Lesson 2 where the data are the percent change in total personnel from last year to this year for n=10 companies. We used a normal likelihood with known variance and t distribution for the prior on the unknown mean. Suppose the values are y=(1.2,1.4,-0.5,0.3,0.9,2.3,1.0,0.1,1.3,1.9) . Because this model is not conjugate, the posterior distribution is not in a standard form that we can easily sample. To obtain posterior samples, we will set up a Markov chain whose stationary distribution is this posterior distribution.
Recall that the posterior distribution is:
\[p(\mu \mid y_1, \ldots, y_n) \propto \frac{\exp[ n ( \bar{y} \mu - \mu^2/2)]}{1 + \mu^2}\] The posterior distribution on the left is our target distribution and the expression on the right is our g(\(\mu\)).
The first thing we can do in R is write a function to evaluate g(\(\mu\)). Because posterior distributions include likelihoods (the product of many numbers that are potentially small), g(\(\mu\)) might evaluate to such a small number that to the computer, it effectively zero. This will cause a problem when we evaluate the acceptance ratio \(\alpha\). To avoid this problem, we can work on the log scale, which will be more numerically stable. Thus, we will write a function to evaluate
\[\log(g(\mu)) = n ( \bar{y} \mu - \mu^2/2) - \log(1 + \mu^2)\]
This function will require three arguments, \(\mu\) , \(\bar{y}\) and n.
Next, let’s write a function to execute the Random-Walk Metropolis-Hastings sampler with normal proposals.
Now, let’s set up the problem.
Finally, we’re ready to run the sampler! Let’s use m=1000 iterations and proposal standard deviation (which controls the proposal step size) 3.0, and initial value at the prior median 0
List of 2
$ mu : num [1:1000] -0.113 1.507 1.507 1.507 1.507 ...
$ accpt: num 0.122
This last plot is called a trace plot. It shows the history of the chain and provides basic feedback about whether the chain has reached its stationary distribution.
It appears our proposal step size was too large (acceptance rate below 23%). Let’s try another.
[1] 0.946
[1] 0.38
Hey, that looks pretty good. Just for fun, let’s see what happens if we initialize the chain at some far-off value.
[1] 0.387
It took awhile to find the stationary distribution, but it looks like we succeeded! If we discard the first 100 or so values, it appears like the rest of the samples come from the stationary distribution, our posterior distribution! Let’s plot the posterior density against the prior to see how the data updated our belief about \(\mu\).
These results are encouraging, but they are preliminary. We still need to investigate more formally whether our Markov chain has converged to the stationary distribution. We will explore this in a future lesson.
Obtaining posterior samples using the Metropolis-Hastings algorithm can be time-consuming and require some fine-tuning, as we’ve just seen. The good news is that we can rely on software to do most of the work for us. In the next couple of videos, we’ll introduce a program that will make posterior sampling easy.
So far, we have demonstrated MCMC for a single parameter. What if we seek the posterior distribution of multiple parameters, and that posterior distribution does not have a standard form? One option is to perform Metropolis-Hastings (M-H) by sampling candidates for all parameters at once, and accepting or rejecting all of those candidates together. While this is possible, it can get complicated. Another (simpler) option is to sample the parameters one at a time.
As a simple example, suppose we have a joint posterior distribution for two parameters \(\theta\) and \(\phi\), written \(P(\theta,\phi|y) \propto g (\theta,\phi)\) . If we knew the value of \(\phi\), then we would just draw a candidate for \(\theta\) and use \(g(\theta,\phi)\) to compute our Metropolis-Hastings ratio, and possibly accept the candidate. Before moving on to the next iteration, if we don’t know \(\phi\), then we can perform a similar update for it. Draw a candidate for \(\phi\) using some proposal distribution and again use \(g(\theta,\phi)\) to compute our Metropolis-Hastings ratio. Here we pretend we know the value of \(\theta\) by substituting its current iteration from the Markov chain. Once we’ve drawn for both \(\theta\) and \(\phi\), that completes one iteration and we begin the next iteration by drawing a new \(\theta\). In other words, we’re just going back and forth, updating the parameters one at a time, plugging the current value of the other parameter into \(g(\theta,\phi)\).
This idea of one-at-a-time updates is used in what we call Gibbs sampling, which also produces a stationary Markov chain (whose stationary distribution is the posterior).
Before describing the full Gibbs sampling algorithm, there’s one more thing we can do. Using the chain rule of probability, we have \(p(\theta, \phi \mid y) = p(\theta \mid \phi, y) \cdot p(\phi \mid y)\). notice that the only difference between \(p(\theta, \phi \mid y)\) and \(p(\theta \mid \phi, y)\) is multiplication by a factor that doesn’t involve \(\theta\) . Since the \(g(\theta,\phi)\) function above, when viewed as a function of \(\theta\) is proportional to both these expressions, we might as well have replaced it with \(p(\theta \mid \phi,y)\) in our update for \(\theta\).This distribution \(p(\theta \mid \phi, y)\) is called the full conditional distribution for \(\theta\). Why use it instead of \(g(\theta,\phi)\)? In some cases, the full conditional distribution is a standard distribution we know how to sample. If that happens, we no longer need to draw a candidate and decide whether to accept it. In fact, if we treat the full conditional distribution as a candidate proposal distribution, the resulting Metropolis-Hastings acceptance probability becomes exactly 1.
Gibbs samplers require a little more work up front because you need to find the full conditional distribution for each parameter. The good news is that all full conditional distributions have the same starting point: the full joint posterior distribution. Using the example above, we have \(p(\theta \mid \phi,y) \propto p(\theta,\phi|y)\) where we simply now treat \(\phi\) as a known number. Likewise, the other full conditional is \(p(\phi \mid \theta ,y) \propto p(\theta,\phi\mid y)\) where here, we consider \(\theta\) to be a known number. We always start with the full posterior distribution. Thus, the process of finding full conditional distributions is the same as finding the posterior distribution of each parameter, pretending that all other parameters are known.
The idea of Gibbs sampling is that we can update multiple parameters by sampling just one parameter at a time, cycling through all parameters and repeating. To perform the update for one particular parameter, we substitute in the current values of all other parameters.
Here is the algorithm. Suppose we have a joint posterior distribution for two parameters \(\theta\) and \(\phi\), written \(P(\theta,\phi\mid y)\). If we can find the distribution of each parameter at a time, i.e., \(P(\theta \mid \phi,y)\) and \(P(\phi\mid \theta,y)\), then we can take turns sampling these distributions like so:
1.Using \(\phi_{i-1}\)draw \(\theta_i\) from \(P(\theta | \phi= \phi_{i-1} ,y )\) .
2.Using \(\theta_i\), draw \(\phi_i\) from \(P(\phi|\theta=\theta_i,y)\).
Together, steps 1 and 2 complete one cycle of the Gibbs sampler and produce the draw for (\(\theta_i , \phi_i\)) in one iteration of a MCMC sampler. If there are more than two parameters, we can handle that also. One Gibbs cycle would include an update for each of the parameters.
Let’s make an example, where we have normal likelihood with unknown mean and unknown variance. The model is : \[ \begin{aligned} y_i \mid \mu, \sigma^2 &\overset{\text{iid}}{\sim} \text{N} ( \mu, \sigma^2 ), \quad i=1,\ldots,n \\ \mu &\sim \text{N}(\mu_0, \sigma_0^2) \\ \sigma^2 &\sim \text{IG}(\nu_0, \beta_0) \ \end{aligned} \].
We chose a normal prior for \(\mu\) because, in the case where \(\sigma^2\) is known, the normal is the conjugate prior for \(\mu\). Likewise, in the case where \(\mu\) is known, the inverse-gamma is the conjugate prior for \(\sigma^2\). This will give us convenient full conditional distributions in a Gibbs sampler.
Let’s first work out the form of the full posterior distribution. When we begin analyzing data, the JAGS software will complete this step for us. However, it is extremely valuable to see and understand how this works.
\[ \begin{aligned} \\ p( \mu, \sigma^2 \mid y_1, y_2, \ldots, y_n ) &\propto p(y_1, y_2, \ldots, y_n \mid \mu, \sigma^2) p(\mu) p(\sigma^2) = \prod_{i=1}^n \text{N} ( y_i \mid \mu, \sigma^2 ) \times \text{N}( \mu \mid \mu_0, \sigma_0^2) \times \text{IG}(\sigma^2 \mid \nu_0, \beta_0) \\ = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\exp \left[ -\frac{(y_i - \mu)^2}{2\sigma^2} \right] \times \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] \times \frac{\beta_0^{\nu_0}}{\Gamma(\nu_0)}(\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \\ \propto (\sigma^2)^{-n/2} \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \end{aligned} \]
From here, it is easy to continue on to find the two full conditional distributions we need. First let’s look at \(\mu\), assuming \(\sigma^2\) is known (in which case it becomes a constant and is absorbed into the normalizing constant): \[\begin{aligned} p(\mu \mid \sigma^2, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] \exp \left[ -\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right] \\ &\propto \exp \left[ -\frac{1}{2} \left( \frac{ \sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} + \frac{(\mu - \mu_0)^2}{2\sigma_0^2} \right) \right] \\ &\propto \text{N} \left( \mu \mid \frac{n\bar{y}/\sigma^2 + \mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \, \frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right) \, , \end {aligned}\] which we derived in the supplementary material of the last course. So, given the data and \(\sigma^2\), \(\mu\) follows this normal distribution.
Now let’s look at \(\sigma^2\), assuming \(\mu\) is known: \[\begin{aligned} p(\sigma^2 \mid \mu, y_1, \ldots, y_n) &\propto p( \mu, \sigma^2 \mid y_1, \ldots, y_n ) \\ &\propto (\sigma^2)^{-n/2} \exp \left[ -\frac{\sum_{i=1}^n (y_i - \mu)^2}{2\sigma^2} \right] (\sigma^2)^{-(\nu_0 + 1)} \exp \left[ -\frac{\beta_0}{\sigma^2} \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto (\sigma^2)^{-(\nu_0 + n/2 + 1)} \exp \left[ -\frac{1}{\sigma^2} \left( \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \right] I_{\sigma^2 > 0}(\sigma^2) \\ &\propto \text{IG}\left( \sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right) \, . \end{aligned}\]
These two distributions provide the basis of a Gibbs sampler to simulate from a Markov chain whose stationary distribution is the full posterior of both \(\mu\) and \(\sigma^2\). We simply alternate draws between these two parameters, using the most recent draw of one parameter to update the other.
We will do this in R in the next segment.
To implement the Gibbs sampler we just described, let’s return to our running example where the data are the percent change in total personnel from last year to this year for n=10 companies. We’ll still use a normal likelihood, but now we’ll relax the assumption that we know the variance of growth between companies, \(\sigma^2\), and estimate that variance. Instead of the t prior from earlier, we will use the conditionally conjugate priors, normal for \(\mu\) and inverse-gamma for \(\sigma^2\)
The first step will be to write functions to simulate from the full conditional distributions we derived in the previous segment. The full conditional for \(\mu\), given \(\sigma^2\) and data is
\(\text{N} \left( \mu \mid \frac{n\bar{y}/\sigma^2 + \mu_0/\sigma_0^2}{n/\sigma^2 + 1/\sigma_0^2}, \, \frac{1}{n/\sigma^2 + 1/\sigma_0^2} \right)\)
The full conditional for \(\sigma^2\) given \(\mu\) and data is:
\(\text{IG}\left( \sigma^2 \mid \nu_0 + \frac{n}{2}, \, \beta_0 + \frac{\sum_{i=1}^n (y_i - \mu)^2}{2} \right)\)
With functions for drawing from the full conditionals, we are ready to write a function to perform Gibbs sampling.
Now we are ready to set up the problem in \(\text{R}\).
mu sig2
[1,] 0.3746992 1.5179144
[2,] 0.4900277 0.8532821
[3,] 0.2536817 1.4325174
[4,] 1.1378504 1.2337821
[5,] 1.0016641 0.8409815
[6,] 1.1576873 0.7926196
Iterations = 1:1000
Thinning interval = 1
Number of chains = 1
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu 0.9051 0.2868 0.00907 0.00907
sig2 0.9282 0.5177 0.01637 0.01810
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu 0.3024 0.7244 0.9089 1.090 1.481
sig2 0.3577 0.6084 0.8188 1.094 2.141
As with the Metropolis-Hastings example, these chains appear to have converged. In the next lesson, we will discuss convergence in more detail.
As an example of a one-way ANOVA, we’ll look at the Plant Growth data in R.
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
5 4.50 ctrl
6 4.61 ctrl
Because the explanatory variable group is a factor and not continuous, we choose to visualize the data with box plots rather than scatter plots.
The box plots summarize the distribution of the data for each of the three groups. It appears that treatment 2 has the highest mean yield. It might be questionable whether each group has the same variance, but we’ll assume that is the case.
Modeling Again, we can start with the reference analysis (with a noninformative prior) with a linear model in R.
Call:
lm(formula = weight ~ group, data = PlantGrowth)
Residuals:
Min 1Q Median 3Q Max
-1.0710 -0.4180 -0.0060 0.2627 1.3690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.1971 25.527 <2e-16 ***
grouptrt1 -0.3710 0.2788 -1.331 0.1944
grouptrt2 0.4940 0.2788 1.772 0.0877 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096
F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
group 2 3.7663 1.8832 4.8461 0.01591 *
Residuals 27 10.4921 0.3886
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The default model structure in R is the linear model with dummy indicator variables. Hence, the “intercept” in this model is the mean yield for the control group. The two other parameters are the estimated effects of treatments 1 and 2. To recover the mean yield in treatment group 1, you would add the intercept term and the treatment 1 effect. To see how R sets the model up, use the model.matrix(lmod) function to extract the X matrix.
The anova() function in R compares variability of observations between the treatment groups to variability within the treatment groups to test whether all means are equal or whether at least one is different. The small p-value here suggests that the means are not all equal.
Let’s fit the cell means model in JAGS.
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 30
Unobserved stochastic nodes: 4
Total graph size: 85
Initializing model
Model checking As usual, we check for convergence of our MCMC.
Potential scale reduction factors:
Point est. Upper C.I.
mu[1] 1 1
mu[2] 1 1
mu[3] 1 1
sig 1 1
Multivariate psrf
1
mu[1] mu[2] mu[3] sig
Lag 0 1.0000000000 1.000000000 1.000000000 1.000000e+00
Lag 1 -0.0133832476 0.015198728 -0.002811435 1.021045e-01
Lag 5 -0.0048737812 0.012869721 0.003196306 4.627230e-03
Lag 10 -0.0084556933 -0.007671575 0.011585229 -3.373542e-03
Lag 50 -0.0004832054 -0.017920090 0.008207320 -6.108501e-05
mu[1] mu[2] mu[3] sig
15211.40 14675.47 15000.00 12224.44
mu[1] mu[2] mu[3] sig
5.0324936 4.6616392 5.5258237 0.7128744
Again, it might be appropriate to have a separate variance for each group. We will have you do that as an exercise.
Results
Let’s look at the posterior summary of the parameters.
Iterations = 1001:6000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu[1] 5.0325 0.22889 0.0018689 0.0018562
mu[2] 4.6616 0.22856 0.0018662 0.0018881
mu[3] 5.5258 0.22898 0.0018696 0.0018697
sig 0.7129 0.09299 0.0007592 0.0008417
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu[1] 4.5858 4.8797 5.0330 5.1835 5.4879
mu[2] 4.2075 4.5119 4.6620 4.8156 5.1085
mu[3] 5.0695 5.3758 5.5274 5.6761 5.9787
sig 0.5605 0.6471 0.7031 0.7677 0.9225
lower upper
mu[1] 4.5838513 5.4831706
mu[2] 4.2105105 5.1103137
mu[3] 5.0650575 5.9709131
sig 0.5466879 0.9009385
attr(,"Probability")
[1] 0.95
The HPDinterval() function in the coda package calculates intervals of highest posterior density for each parameter.
We are interested to know if one of the treatments increases mean yield. It is clear that treatment 1 does not. What about treatment 2?
[1] 0.9364667
There is a high posterior probability that the mean yield for treatment 2 is greater than the mean yield for the control group.
It may be the case that treatment 2 would be costly to put into production. Suppose that to be worthwhile, this treatment must increase mean yield by 10%. What is the posterior probability that the increase is at least that?
[1] 0.4913333
We have about 50/50 odds that adopting treatment 2 would increase mean yield by at least 10%.
Let’s explore an example with two factors. We’ll use the Warpbreaks data set in R. Check the documentation for a description of the data by typing ?warpbreaks.
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
Again, we visualize the data with box plots.
L M H
A 9 9 9
B 9 9 9
The different groups have more similar variance if we use the logarithm of breaks. From this visualization, it looks like both factors may play a role in the number of breaks. It appears that there is a general decrease in breaks as we move from low to medium to high tension. Let’s start with a one-way model using tension only.
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 54
Unobserved stochastic nodes: 4
Total graph size: 134
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
mu[1] 1 1
mu[2] 1 1
mu[3] 1 1
sig 1 1
Multivariate psrf
1
mu[1] mu[2] mu[3] sig
Lag 0 1.000000000 1.000000000 1.0000000000 1.00000000
Lag 1 0.010164005 0.005704338 0.0039054294 0.06323666
Lag 5 -0.009100413 -0.003536362 -0.0021492814 0.01783904
Lag 10 -0.006955141 0.011452014 0.0071958092 -0.01385515
Lag 50 0.006191678 0.004969063 -0.0002109035 -0.01285054
mu[1] mu[2] mu[3] sig
14564.60 14769.12 15000.00 12931.35
Here are the results.
Iterations = 1001:6000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu[1] 3.4992 0.1363 0.0011127 0.0011306
mu[2] 3.2113 0.1360 0.0011103 0.0011194
mu[3] 3.0077 0.1360 0.0011103 0.0011104
sig 0.5743 0.0553 0.0004515 0.0004878
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu[1] 3.2336 3.4088 3.4979 3.5898 3.7699
mu[2] 2.9450 3.1204 3.2111 3.3009 3.4784
mu[3] 2.7411 2.9171 3.0067 3.0965 3.2758
sig 0.4784 0.5351 0.5703 0.6081 0.6953
The 95% posterior interval for the mean of group 2 (medium tension) overlaps with both the low and high groups, but the low and high groups do not have overlapping intervals. That is a pretty strong indication that the means for low and high tension are different. Let’s collect the DIC for this model and move on to the two-way model.
With two factors, one with two levels and the other with three, we have six treatment groups, which is the same situation we discussed when introducing multiple factor ANOVA. We will first fit the additive model which treats the two factors separately with no interaction. To get the XX matrix (or design matrix) for this model, we can extract it from a linear model in R.
(Intercept) woolB tensionM tensionH
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
(Intercept) woolB tensionM tensionH
49 1 1 0 1
50 1 1 0 1
51 1 1 0 1
52 1 1 0 1
53 1 1 0 1
54 1 1 0 1
By default, R has chosen the mean for wool A and low tension to be the intercept. Then, there is an effect for wool B, and effects for medium tension and high tension, each associated with dummy indicator variables.
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 54
Unobserved stochastic nodes: 5
Total graph size: 257
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
mu[1] 1 1
mu[2] 1 1
mu[3] 1 1
sig 1 1
Multivariate psrf
1
mu[1] mu[2] mu[3] sig
Lag 0 1.000000000 1.000000000 1.0000000000 1.00000000
Lag 1 0.010164005 0.005704338 0.0039054294 0.06323666
Lag 5 -0.009100413 -0.003536362 -0.0021492814 0.01783904
Lag 10 -0.006955141 0.011452014 0.0071958092 -0.01385515
Lag 50 0.006191678 0.004969063 -0.0002109035 -0.01285054
mu[1] mu[2] mu[3] sig
14564.60 14769.12 15000.00 12931.35
Let’s summarize the results, collect the DIC for this model, and compare it to the first one-way model.
Iterations = 1001:6000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
alpha -0.1520 0.12556 0.0010252 0.0017520
beta[1] -0.2848 0.15249 0.0012451 0.0023917
beta[2] -0.4882 0.15177 0.0012392 0.0023706
int 3.5753 0.12424 0.0010144 0.0023896
sig 0.4551 0.04521 0.0003691 0.0004032
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
alpha -0.3990 -0.2364 -0.1521 -0.06805 0.09192
beta[1] -0.5816 -0.3869 -0.2846 -0.18258 0.01556
beta[2] -0.7882 -0.5884 -0.4871 -0.38741 -0.18963
int 3.3302 3.4936 3.5758 3.65705 3.81890
sig 0.3776 0.4233 0.4512 0.48241 0.55391
Mean deviance: 55.38
penalty 5.134
Penalized deviance: 60.51
Mean deviance: 66.74
penalty 4.146
Penalized deviance: 70.89
This suggests we haven’t gained much from adding the wool factor to the model. Before we discard wool, however, we should consider whether there is an interaction. Let’s look again at the box plot with all six treatment groups.
Our two-way model has a single effect for wool B and the estimate is negative. If this is true, then we would expect wool B to be associated with fewer breaks than its wool A counterpart on average. This is true for low and high tension, but it appears that breaks are higher for wool B when there is medium tension. That is, the effect for wool B is not consistent across tension levels, so it may appropriate to add an interaction term. In R, this would look like:
Call:
lm(formula = log(breaks) ~ .^2, data = warpbreaks)
Residuals:
Min 1Q Median 3Q Max
-0.81504 -0.27885 0.04042 0.27319 0.64358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.7179 0.1247 29.824 < 2e-16 ***
woolB -0.4356 0.1763 -2.471 0.01709 *
tensionM -0.6012 0.1763 -3.410 0.00133 **
tensionH -0.6003 0.1763 -3.405 0.00134 **
woolB:tensionM 0.6281 0.2493 2.519 0.01514 *
woolB:tensionH 0.2221 0.2493 0.891 0.37749
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.374 on 48 degrees of freedom
Multiple R-squared: 0.3363, Adjusted R-squared: 0.2672
F-statistic: 4.864 on 5 and 48 DF, p-value: 0.001116
Adding the interaction, we get an effect for being in wool B and medium tension, as well as for being in wool B and high tension. There are now six parameters for the mean, one for each treatment group, so this model is equivalent to the full cell means model. Let’s use that.
In this new model, \(\mu\) will be a matrix with six entries, each corresponding to a treatment group.
'data.frame': 54 obs. of 3 variables:
$ breaks : num 26 30 54 25 70 52 51 26 67 18 ...
$ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
$ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 54
Unobserved stochastic nodes: 7
Total graph size: 199
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
mu[1,1] 1 1
mu[2,1] 1 1
mu[1,2] 1 1
mu[2,2] 1 1
mu[1,3] 1 1
mu[2,3] 1 1
sig 1 1
Multivariate psrf
1
mu[1,1] mu[2,1] mu[1,2] mu[2,2] mu[1,3]
Lag 0 1.000000000 1.000000000 1.0000000000 1.0000000000 1.000000000
Lag 1 -0.009952869 0.001711837 -0.0112878120 -0.0002450259 0.013291517
Lag 5 0.001364161 -0.007936163 0.0009364354 -0.0029507451 -0.004015313
Lag 10 0.009133840 0.015634833 0.0069595375 -0.0006803493 -0.005311946
Lag 50 -0.002210705 0.005235024 0.0013131967 -0.0067432771 -0.011722877
mu[2,3] sig
Lag 0 1.000000000 1.000000000
Lag 1 0.006225100 0.098312873
Lag 5 0.017230531 0.007220832
Lag 10 -0.009841003 0.009865716
Lag 50 -0.012646510 0.001219488
mu[1,1] mu[2,1] mu[1,2] mu[2,2] mu[1,3] mu[2,3] sig
15240.87 15214.31 15000.00 15106.12 14800.92 15000.00 12464.66
[[1]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
Burn-in Total Lower bound Dependence
(M) (N) (Nmin) factor (I)
mu[1,1] 2 3620 3746 0.966
mu[2,1] 2 3803 3746 1.020
mu[1,2] 2 3930 3746 1.050
mu[2,2] 2 3741 3746 0.999
mu[1,3] 2 3680 3746 0.982
mu[2,3] 2 3741 3746 0.999
sig 2 3930 3746 1.050
[[2]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
Burn-in Total Lower bound Dependence
(M) (N) (Nmin) factor (I)
mu[1,1] 2 3680 3746 0.982
mu[2,1] 2 3620 3746 0.966
mu[1,2] 2 3680 3746 0.982
mu[2,2] 2 3680 3746 0.982
mu[1,3] 2 3930 3746 1.050
mu[2,3] 2 3680 3746 0.982
sig 2 3680 3746 0.982
[[3]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
Burn-in Total Lower bound Dependence
(M) (N) (Nmin) factor (I)
mu[1,1] 2 3866 3746 1.030
mu[2,1] 2 3803 3746 1.020
mu[1,2] 2 3741 3746 0.999
mu[2,2] 2 3680 3746 0.982
mu[1,3] 2 3803 3746 1.020
mu[2,3] 2 3803 3746 1.020
sig 2 3741 3746 0.999
Let’s compute the DIC and compare with our previous models.
Mean deviance: 51.98
penalty 7.179
Penalized deviance: 59.16
Mean deviance: 55.38
penalty 5.134
Penalized deviance: 60.51
Mean deviance: 66.74
penalty 4.146
Penalized deviance: 70.89
This suggests that the full model with interaction between wool and tension is the best for explaining/predicting warp breaks.
Iterations = 1001:6000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
mu[1,1] 3.7164 0.14861 0.0012134 0.0012042
mu[2,1] 3.2852 0.14916 0.0012179 0.0012110
mu[1,2] 3.1156 0.14910 0.0012174 0.0012175
mu[2,2] 3.3088 0.14786 0.0012073 0.0012031
mu[1,3] 3.1189 0.15001 0.0012248 0.0012332
mu[2,3] 2.9034 0.14767 0.0012057 0.0012057
sig 0.4428 0.04484 0.0003661 0.0004018
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
mu[1,1] 3.4244 3.6201 3.7161 3.8154 4.011
mu[2,1] 2.9888 3.1867 3.2851 3.3833 3.581
mu[1,2] 2.8169 3.0174 3.1166 3.2144 3.408
mu[2,2] 3.0215 3.2107 3.3072 3.4081 3.603
mu[1,3] 2.8279 3.0190 3.1180 3.2179 3.420
mu[2,3] 2.6113 2.8057 2.9041 3.0015 3.196
sig 0.3656 0.4114 0.4391 0.4702 0.541
lower upper
mu[1,1] 3.4208976 4.0067858
mu[2,1] 3.0005080 3.5894101
mu[1,2] 2.8091998 3.3983164
mu[2,2] 3.0178575 3.5972163
mu[1,3] 2.8145120 3.4053630
mu[2,3] 2.6197778 3.2033014
sig 0.3615852 0.5351374
attr(,"Probability")
[1] 0.95
It might be tempting to look at comparisons between each combination of treatments, but we warn that this could yield spurious results. When we discussed the statistical modeling cycle, we said it is best not to search your results for interesting hypotheses, because if there are many hypotheses, some will appear to show “effects” or “associations” simply due to chance. Results are most reliable when we determine a relatively small number of hypotheses we are interested in beforehand, collect the data, and statistically evaluate the evidence for them.
One question we might be interested in with these data is finding the treatment combination that produces the fewest breaks. To calculate this, we can go through our posterior samples and for each sample, find out which group has the smallest mean. These counts help us determine the posterior probability that each of the treatment groups has the smallest mean.
2 3 4 5 6
0.0154000 0.1174000 0.0106000 0.1160667 0.7405333
The evidence supports wool B with high tension as the treatment that produces the fewest breaks.
As an example of linear regression, we’ll look at the Leinhardt data from the car package in R.
income infant region oil
Australia 3426 26.7 Asia no
Austria 3350 23.7 Europe no
Belgium 3346 17.0 Europe no
Canada 4751 16.8 Americas no
Denmark 5029 13.5 Europe no
Finland 3312 10.1 Europe no
So the Leinhardt dataset has 105 observations and 4 variables: income, infant, region, oil.
'data.frame': 105 obs. of 4 variables:
$ income: int 3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
$ infant: num 26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
$ region: Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
$ oil : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
So the correlation between the variables is shown by the following paired scatterplot.
We’ll start with a simple linear regression model that relates infant mortality to per capita income, but first let’s take a look if there is a linear relation between the two variables. This can be easily tested by a scatterplot.
As we can easily observe our setting is not suitable for a linear regression model, because the variables are heavily skewed, plus the scatterplot shows no linearity. This can be corrected in some cases when we use the log transformation :
A linear model appears much more appropriate on this (log) scale. The first model we may apply is the frequentist (non informative Bayesian) linear model.
Call:
lm(formula = loginfant ~ logincome, data = Leinhardt)
Residuals:
Min 1Q Median 3Q Max
-1.66694 -0.42779 -0.02649 0.30441 3.08415
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.14582 0.31654 22.575 <2e-16 ***
logincome -0.51179 0.05122 -9.992 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6867 on 99 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.5021, Adjusted R-squared: 0.4971
F-statistic: 99.84 on 1 and 99 DF, p-value: < 2.2e-16
Let’s check also the model fit.
null device
1
Now let’s fit the same model using JAGS. We can also ommit the some countries with missing values for easier calculations.
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 101
Unobserved stochastic nodes: 3
Total graph size: 411
Initializing model
Before we check the inferences from the model, we should perform convergence diagnostics for our Markov chains.
Potential scale reduction factors:
Point est. Upper C.I.
b[1] 1.01 1.02
b[2] 1.01 1.02
sig 1.00 1.00
Multivariate psrf
1
b[1] b[2] sig
Lag 0 1.0000000 1.00000000 1.000000000
Lag 1 0.9566226 0.95705969 0.017429397
Lag 5 0.7992579 0.79868771 -0.003117246
Lag 10 0.6313814 0.63098299 0.005512649
Lag 50 0.0973288 0.09687318 -0.008419093
b[1] b[2] sig
332.6212 334.7034 14416.0966
Iterations = 1001:6000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
b[1] 7.1717 0.46923 0.0038312 0.0262549
b[2] -0.5160 0.07587 0.0006194 0.0042264
sig 0.9717 0.06852 0.0005594 0.0005717
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
b[1] 6.2864 6.8586 7.1640 7.4739 8.1367
b[2] -0.6730 -0.5649 -0.5141 -0.4649 -0.3726
sig 0.8501 0.9237 0.9672 1.0143 1.1197
Checking residuals (the difference between the response and the model’s prediction for that value) is important with linear models since residuals can reveal violations of the assumptions we made to specify the model. In particular, we are looking for any sign that the model is not linear, normally distributed, or that the observations are not independent (conditional on covariates).
First, let’s look at what would have happened if we fit the reference linear model to the un-transformed variables.
Now let’s return to our model fit to the log-transformed variables. In a Bayesian model, we have distributions for residuals, but we’ll simplify and look only at the residuals evaluated at the posterior mean of the parameters.
[,1] [,2]
[1,] 1 8.139149
[2,] 1 8.116716
[3,] 1 8.115521
[4,] 1 8.466110
[5,] 1 8.522976
[6,] 1 8.105308
b[1] b[2] sig
7.1716592 -0.5160421 0.9717217
[1] "Saudi.Arabia" "Libya" "Zambia" "Brazil"
[5] "Afganistan"
The residuals look pretty good here (no patterns, shapes) except for two strong outliers, Saudi Arabia and Libya. When outliers appear, it is a good idea to double check that they are not just errors in data entry. If the values are correct, you may reconsider whether these data points really are representative of the data you are trying to model. If you conclude that they are not (for example, they were recorded on different years), you may be able to justify dropping these data points from the data set.
If you conclude that the outliers are part of data and should not be removed, we have several modeling options to accommodate them. We will address these in the next segment.
The first approach is to look for additional covariates that may be able to explain the outliers. For example, there could be a number of variables that provide information about infant mortality above and beyond what income provides.
Looking back at our data, there are two variables we haven’t used yet: region and oil. The oil variable indicates oil-exporting countries. Both Saudi Arabia and Libya are oil-exporting countries, so perhaps this might explain part of the anomaly.
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
[36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 101
Unobserved stochastic nodes: 4
Total graph size: 517
Initializing model
As usual, check the convergence diagnostics.
Potential scale reduction factors:
Point est. Upper C.I.
b[1] 1.02 1.07
b[2] 1.02 1.07
b[3] 1.00 1.00
sig 1.00 1.00
Multivariate psrf
1.02
b[1] b[2] b[3] sig
Lag 0 1.00000000 1.00000000 1.000000000 1.000000000
Lag 1 0.95293549 0.95405069 0.067729189 0.015458865
Lag 5 0.78659991 0.78834984 0.004564552 0.008746490
Lag 10 0.61380756 0.61546721 0.004556215 0.021915678
Lag 50 0.02065725 0.01846522 -0.011600172 -0.002391477
b[1] b[2] b[3] sig
367.3007 357.6941 12826.9750 13721.0897
Iterations = 1001:6000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
b[1] 7.1590 0.44509 0.0036341 0.0231609
b[2] -0.5240 0.07205 0.0005883 0.0037946
b[3] 0.7824 0.35072 0.0028636 0.0030977
sig 0.9529 0.06721 0.0005487 0.0005768
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
b[1] 6.3063 6.8595 7.1504 7.4505 8.072
b[2] -0.6717 -0.5709 -0.5227 -0.4757 -0.386
b[3] 0.0830 0.5491 0.7851 1.0159 1.479
sig 0.8315 0.9062 0.9498 0.9957 1.095
It looks like there is a positive relationship between oil-production and log-infant mortality. Because these data are merely observational, we cannot say that oil-production causes an increase in infant mortality (indeed that most certainly isn’t the case), but we can say that they are positively correlated.
Now let’s check the residuals.
[,1] [,2] [,3]
[1,] 1 8.139149 0
[2,] 1 8.116716 0
[3,] 1 8.115521 0
[4,] 1 8.466110 0
[5,] 1 8.522976 0
[6,] 1 8.105308 0
b[1] b[2] b[3] sig
7.1590294 -0.5240216 0.7824174 0.9529251
[1] 0.6488851
These look much better, although the residuals for Saudi Arabia and Libya are still more than three standard deviations away from the mean of the residuals. We might consider adding the other covariate region, but instead let’s look at another option when we are faced with strong outliers.
t likelihood
Let’s consider changing the likelihood. The normal likelihood has thin tails (almost all of the probability is concentrated within the first few standard deviations from the mean). This does not accommodate outliers well. Consequently, models with the normal likelihood might be overly-influenced by outliers. Recall that the tt distribution is similar to the normal distribution, but it has thicker tails which can accommodate outliers.
The t linear model might look something like this. Notice that the tt distribution has three parameters, including a positive “degrees of freedom” parameter. The smaller the degrees of freedom, the heavier the tails of the distribution. We might fix the degrees of freedom to some number, or we can assign it a prior distribution.
We have now proposed three different models. How do we compare their performance on our data? In the previous course, we discussed estimating parameters in models using the maximum likelihood method. Similarly, we can choose between competing models using the same idea.
We will use a quantity known as the deviance information criterion (DIC). It essentially calculates the posterior mean of the log-likelihood and adds a penalty for model complexity.
Let’s calculate the DIC for our first two models:
the simple linear regression on log-income,
Mean deviance: 231.4
penalty 2.837
Penalized deviance: 234.2
Mean deviance: 225.5
penalty 4.135
Penalized deviance: 229.6
The first number is the Monte Carlo estimated posterior mean deviance, which equals -2 times the log-likelihood (plus a constant that will be irrelevant for comparing models). Because of that -2 factor, a smaller deviance means a higher likelihood.
Next, we are given a penalty for the complexity of our model. This penalty is necessary because we can always increase the likelihood of the model by making it more complex to fit the data exactly. We don’t want to do this because over-fit models generalize poorly. This penalty is roughly equal to the effective number of parameters in your model. You can see this here. With the first model, we had a variance parameter and two betas, for a total of three parameters. In the second model, we added one more beta for the oil effect.
We add these two quantities to get the DIC (the last number). The better-fitting model has a lower DIC value. In this case, the gains we receive in deviance by adding the is_oil covariate outweigh the penalty for adding an extra parameter. The final DIC for the second model is lower than for the first, so we would prefer using the second model.
We encourage you to explore different model specifications and compare their fit to the data using DIC. Wikipedia provides a good introduction to DIC and we can find more details about the JAGS implementation through the rjags package documentation by entering ?dic.samples in the R console.
New York City Crime Data
As an example of a Bayesian linear regression model, we look at New York City crime data from 1966 to 1967. The outcome variable (THEFT) is the increase or decrease in the seasonally adjusted rate of grand larcenies in 23 Manhattan police precincts from a 27-week pre-intervention period compared to a 58-week intervention period. The predictor variables are the percent increase or decrease in the number of police officers in a precinct (MAN), and whether the precinct was located uptown, midtown or downtown.
We specify the model as:
\[THEFT_i \sim N(\mu, \sigma^) \\ \mu = \beta_0 + \beta_1 * MAN + district \\ \frac{1}{\sigma^2} \sim \Gamma(0.001, 0.001) \\ \beta_0 \sim N(0, 100000) \\ \beta_1 \sim N(0, 100000)\]
We will need a prior for the effect of geographic area (district), but will discuss that in a moment. First, we consider how best to code an indicator variable in BUGS. There are two possible approaches. We can create a “design matrix” of dummy variables. In this approach we create two variables named DIST1 and DIST2, for which downtown precincts are coded 0,0 , midtown precincts are coded 1,0 and uptown precincts are coded 0,1. The BUGS code for this model would then be:
For an example of logistic regression, we’ll use the urine data set from the boot package in R. The response variable is r, which takes on values of 0 or 1. We will remove some rows from the data set which contain missing values.
r gravity ph osmo cond urea calc
1 0 1.021 4.91 725 NA 443 2.45
2 0 1.017 5.74 577 20.0 296 4.49
3 0 1.008 7.20 321 14.9 101 2.36
4 0 1.011 5.51 408 12.6 224 2.15
5 0 1.005 6.52 187 7.5 91 1.16
6 0 1.020 5.27 668 25.3 252 3.34
Let’s look at pairwise scatter plots of the seven variables.
One thing that stands out is that several of these variables are strongly correlated with one another. For example gravity and osmo appear to have a very close linear relationship. Collinearity between xx variables in linear regression models can cause trouble for statistical inference. Two correlated variables will compete for the ability to predict the response variable, leading to unstable estimates. This is not a problem for prediction of the response, if prediction is the end goal of the model. But if our objective is to discover how the variables relate to the response, we should avoid collinearity.
We can more formally estimate the correlation among these variables using the corrplot package.
One primary goal of this analysis is to find out which variables are related to the presence of calcium oxalate crystals. This objective is often called “variable selection.” We have already seen one way to do this: fit several models that include different sets of variables and see which one has the best DIC. Another way to do this is to use a linear model where the priors for the \(\beta\) coefficients favor values near 0 (indicating a weak relationship). This way, the burden of establishing association lies with the data. If there is not a strong signal, we assume it doesn’t exist.
Rather than tailoring a prior for each individual \(\beta\) based on the scale its covariate takes values on, it is customary to subtract the mean and divide by the standard deviation for each variable.
2 3 4 5 6 7
-0.1403037 -1.3710690 -0.9608139 -1.7813240 0.2699514 -0.8240622
gravity ph osmo cond urea
-9.861143e-15 8.511409e-17 1.515743e-16 -1.829852e-16 7.335402e-17
calc
-1.689666e-18
gravity ph osmo cond urea calc
1 1 1 1 1 1
Our prior for the \(\beta\) (which we’ll call bb in the model) coefficients will be the double exponential (or Laplace) distribution, which as the name implies, is the exponential distribution with tails extending in the positive direction as well as the negative direction, with a sharp peak at 0. We can read more about it in the JAGS manual. The distribution looks like:
gravity ph osmo cond urea calc
2 -0.1403037 -0.4163725 -0.1528785 -0.1130908 0.25747827 0.09997564
3 -1.3710690 1.6055972 -1.2218894 -0.7502609 -1.23693077 -0.54608444
4 -0.9608139 -0.7349020 -0.8585927 -1.0376121 -0.29430353 -0.60978050
5 -1.7813240 0.6638579 -1.7814497 -1.6747822 -1.31356713 -0.91006194
6 0.2699514 -1.0672806 0.2271214 0.5490664 -0.07972172 -0.24883614
7 -0.8240622 -0.5825618 -0.6372741 -0.4379226 -0.51654898 -0.83726644
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 77
Unobserved stochastic nodes: 7
Total graph size: 1096
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
b[1] 1 1.00
b[2] 1 1.00
b[3] 1 1.01
b[4] 1 1.01
b[5] 1 1.00
b[6] 1 1.00
int 1 1.00
Multivariate psrf
1
b[1] b[2] b[3] b[4] b[5]
Lag 0 1.00000000 1.0000000000 1.00000000 1.000000000 1.000000000
Lag 1 0.83782954 0.2987078057 0.89775903 0.763252490 0.805402070
Lag 5 0.43071069 -0.0146172396 0.58029004 0.348910306 0.392830115
Lag 10 0.20108565 0.0007040613 0.33031998 0.183319130 0.174396225
Lag 50 -0.00152204 0.0121613179 -0.02271751 -0.007778267 -0.006338512
b[6] int
Lag 0 1.00000000 1.000000000
Lag 1 0.48781767 0.290085186
Lag 5 0.05229748 0.023234647
Lag 10 0.01188647 0.022378266
Lag 50 -0.01617526 -0.007485411
b[1] b[2] b[3] b[4] b[5] b[6] int
1323.5634 8200.7273 808.1154 1451.3082 1424.3403 4959.7263 7338.8099
Let’s look at the results.
Iterations = 2001:7000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
b[1] 1.6751 0.7726 0.006308 0.021266
b[2] -0.1415 0.2860 0.002335 0.003162
b[3] -0.2821 0.8290 0.006769 0.029195
b[4] -0.7649 0.5204 0.004249 0.013801
b[5] -0.6308 0.6090 0.004972 0.016122
b[6] 1.6153 0.4956 0.004046 0.007048
int -0.1812 0.3069 0.002506 0.003591
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
b[1] 0.3449 1.1314 1.6010 2.14544 3.3454
b[2] -0.7333 -0.3247 -0.1250 0.04586 0.3996
b[3] -2.1163 -0.7330 -0.1984 0.16667 1.3038
b[4] -1.8606 -1.0966 -0.7443 -0.39796 0.1629
b[5] -1.9975 -1.0051 -0.5667 -0.17784 0.3808
b[6] 0.7305 1.2720 1.5880 1.92534 2.6926
int -0.7723 -0.3896 -0.1875 0.02098 0.4270
[1] "gravity" "ph" "osmo" "cond" "urea" "calc"
It is clear that the coefficients for variables gravity, cond (conductivity), and calc (calcium concentration) are not 0. The posterior distribution for the coefficient of osmo (osmolarity) looks like the prior, and is almost centered on 0 still, so we’ll conclude that osmo is not a strong predictor of calcium oxalate crystals. The same goes for ph.
urea (urea concentration) appears to be a borderline case. However, if we refer back to our correlations among the variables, we see that urea is highly correlated with gravity, so we opt to remove it.
Our second model looks like this:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 77
Unobserved stochastic nodes: 4
Total graph size: 644
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
b[1] 1 1.00
b[2] 1 1.01
b[3] 1 1.00
int 1 1.00
Multivariate psrf
1
b[1] b[2] b[3] int
Lag 0 1.00000000 1.000000000 1.000000000 1.000000000
Lag 1 0.58793996 0.659269202 0.518618026 0.268846239
Lag 5 0.11264262 0.172292876 0.052563130 0.011156841
Lag 10 0.01394350 0.028131378 0.027109868 0.008667875
Lag 50 0.02214263 -0.004663427 -0.009634286 0.004534375
b[1] b[2] b[3] int
3454.705 2532.593 4482.459 8152.918
Mean deviance: 68.64
penalty 5.499
Penalized deviance: 74.14
Mean deviance: 71.14
penalty 3.999
Penalized deviance: 75.14
Iterations = 2001:7000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
b[1] 1.4203 0.5045 0.004119 0.008616
b[2] -1.3510 0.4539 0.003706 0.009065
b[3] 1.8782 0.5512 0.004500 0.008251
int -0.1526 0.3216 0.002626 0.003563
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
b[1] 0.4903 1.0675 1.3948 1.74094 2.4670
b[2] -2.2732 -1.6524 -1.3403 -1.03696 -0.4860
b[3] 0.9040 1.4874 1.8533 2.22382 3.0601
int -0.7868 -0.3641 -0.1543 0.05851 0.4914
lower upper
b[1] 0.4901620 2.4667547
b[2] -2.2363226 -0.4572467
b[3] 0.8602976 3.0022585
int -0.7868230 0.4915983
attr(,"Probability")
[1] 0.95
[1] "gravity" "cond" "calc"
The DIC is actually better for the first model. Note that we did change the prior between models, and generally we should not use the DIC to choose between priors. Hence comparing DIC between these two models may not be a fair comparison. Nevertheless, they both yield essentially the same conclusions. Higher values of gravity and calc (calcium concentration) are associated with higher probabilities of calcium oxalate crystals, while higher values of cond (conductivity) are associated with lower probabilities of calcium oxalate crystals. There are more modeling options in this scenario, perhaps including transformations of variables, different priors, and interactions between the predictors, but we’ll leave it to you to see if you can improve the model.
How do we turn model parameter estimates into model predictions? The key is the form of the model. Remember that the likelihood is Bernoulli, which is 1 with probability pp. We modeled the logit of pp as a linear model, which we showed in the first segment of this lesson leads to an exponential form for E(y)=p(y)=p.
Take the output from our model in the last segment. We will use the posterior means as point estimates of the parameters.
b[1] b[2] b[3] int
1.4202939 -1.3510461 1.8782490 -0.1526008
The posterior mean of the intercept was about -0.15. Since we centered and scaled all of the covariates, values of 0 for each xx correspond to the average values. Therefore, if we use our last model, then our point estimate for the probability of calcium oxalate crystals when gravity, cond, and calc are at their average values is 1/(1+e^{-(-0.15)})== 0.4625702.
Now suppose we want to make a prediction for a new specimen whose value of gravity is average, whose value of cond is one standard deviation below the mean, and whose value of calc is one standard deviation above the mean. Our point estimate for the probability of calcium oxalate crystals is 1/(1+e^{-(-0.15 + 1.40.0 - 1.3(-1.0) + 1.9*(1.0))})= 0.9547825.
If we want to make predictions in terms of the original xx variable values, we have two options:
For each x variable, subtract the mean and divide by the standard deviation for that variable in the original data set used to fit the model.
Re-fit the model without centering and scaling the covariates.
We can use the same ideas to make predictions for each of the original data points. This is similar to what we did to calculate residuals with earlier models.
First we take the X matrix and matrix multiply it with the posterior means of the coefficients. Then we need to pass these linear values through the inverse of the link function as we did above.
[,1]
2 0.49717423
3 0.10793911
4 0.22085399
5 0.10628912
6 0.27321322
7 0.09079615
These phat values are the model’s predicted probability of calcium oxalate crystals for each data point. We can get a rough idea of how successful the model is by plotting these predicted values against the actual outcome.
Suppose we choose a cutoff for these predicted probabilities. If the model tells us the probability is higher than 0.5, we will classify the observation as a 1 and if it is less than 0.5, we will classify it as a 0. That way the model classifies each data point. Now we can tabulate these classifications against the truth to see how well the model predicts the original data.
0 1
FALSE 38 12
TRUE 6 21
[1] 0.7662338
The correct classification rate is about 76%, not too bad, but not great.
Now suppose that it is considered really bad to predict no calcium oxalate crystal when there in fact is one. We might then choose to lower our threshold for classifying data points as 1s. Say we change it to 0.3. That is, if the model says the probability is greater than 0.3, we will classify it as having a calcium oxalate crystal.
0 1
FALSE 32 7
TRUE 12 26
[1] 0.7532468
It looks like we gave up a little classification accuracy, but we did indeed increase our chances of detecting a true positive.
We could repeat this exercise for many thresholds between 0 and 1, and each time calculate our error rates. This is equivalent to calculating what is called the ROC (receiver-operating characteristic) curve, which is often used to evaluate classification techniques.
These classification tables we have calculated were all in-sample. They were predicting for the same data used to fit the model. We could get a less biased assessment of how well our model performs if we calculated these tables for data that were not used to fit the model. For example, before fitting the model, you could withhold a set of randomly selected “test” data points, and use the model fit to the rest of the “training” data to make predictions on your “test” set.
Data
For an example of Poisson regression, we’ll use the badhealth data set from the COUNT package in R.
numvisit badh age
1 30 0 58
2 20 0 54
3 16 0 44
4 20 0 57
5 15 0 33
6 15 0 28
[1] FALSE
As usual, let’s visualize these data.
It appears that both age and bad health are related to the number of doctor visits. We should include model terms for both variables. If we believe the age/visits relationship is different between healthy and non-healthy populations, we should also include an interaction term. We will fit the full model here and leave it to you to compare it with the simpler additive model.
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1127
Unobserved stochastic nodes: 4
Total graph size: 3673
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
b_age 1.05 1.17
b_badh 1.07 1.21
b_intx 1.07 1.23
int 1.05 1.16
Multivariate psrf
1.07
b_age b_badh b_intx int
Lag 0 1.0000000 1.0000000 1.0000000 1.0000000
Lag 1 0.9568339 0.9648059 0.9670415 0.9529518
Lag 5 0.8289827 0.8669756 0.8727088 0.8248339
Lag 10 0.7026633 0.7648550 0.7714431 0.6972311
Lag 50 0.2181864 0.2220852 0.2361552 0.2120618
b_age b_badh b_intx int
262.2733 198.5700 192.1092 267.0209
To get a general idea of the model’s performance, we can look at predicted values and residuals as usual. Don’t forget that we must apply the inverse of the link function to get predictions for \(\lambda\).
badh age
[1,] 0 58 0
[2,] 0 54 0
[3,] 0 44 0
[4,] 0 57 0
[5,] 0 33 0
[6,] 0 28 0
b_age b_badh b_intx int
0.008308285 1.553631324 -0.010526445 0.354702836
It is not surprising that the variability increases for values predicted at higher values since the mean is also the variance in the Poisson distribution. However, observations predicted to have about two visits should have variance about two, and observations predicted to have about six visits should have variance about six.
[1] 7.022633
[1] 41.1963
Clearly this is not the case with these data. This indicates that either the model fits poorly (meaning the covariates don’t explain enough of the variability in the data), or the data are “overdispersed” for the Poisson likelihood we have chosen. This is a common issue with count data. If the data are more variable than the Poisson likelihood would suggest, a good alternative is the negative binomial distribution, which we will not pursue here.
Assuming the model fit is adequate, we can interpret the results.
Iterations = 2001:7000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
b_age 0.008283 0.002048 1.672e-05 0.0001312
b_badh 1.551187 0.186064 1.519e-03 0.0133989
b_intx -0.010526 0.004297 3.508e-05 0.0003122
int 0.355250 0.080018 6.533e-04 0.0051004
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
b_age 0.004221 0.006972 0.008308 0.009644 0.012218
b_badh 1.177831 1.432008 1.553631 1.674402 1.912097
b_intx -0.018892 -0.013424 -0.010526 -0.007719 -0.001754
int 0.202702 0.302248 0.354703 0.406303 0.517165
The intercept is not necessarily interpretable here because it corresponds to a healthy 0-year-old, whereas the youngest person in the data set is 20 years old.
For healthy individuals, it appears that age has a positive association with number of doctor visits. Clearly, bad health is associated with an increase in expected number of visits. The interaction coefficient is interpreted as an adjustment to the age coefficient for people in bad health. Hence, for people with bad health, age is essentially unassociated with number of visits.
Let’s say we have two people aged 35, one in good health and the other in poor health. What is the posterior probability that the individual with poor health will have more doctor visits? This goes beyond the posterior probabilities we have calculated comparing expected responses in previous lessons. Here we will create Monte Carlo samples for the responses themselves. This is done by taking the Monte Carlo samples of the model parameters, and for each of those, drawing a sample from the likelihood. Let’s walk through this.
First, we need the xx values for each individual. We’ll say the healthy one is Person 1 and the unhealthy one is Person 2. Their xx values are
The posterior samples of the model parameters are stored in mod_csim:
Markov Chain Monte Carlo (MCMC) output:
Start = 1
End = 7
Thinning interval = 1
b_age b_badh b_intx int
[1,] 0.01251371 1.522677 -0.010775965 0.1742109
[2,] 0.01227413 1.499695 -0.010834899 0.2338410
[3,] 0.01168103 1.490651 -0.008354329 0.2261534
[4,] 0.01139581 1.406510 -0.009096313 0.2555852
[5,] 0.01144687 1.395884 -0.008626509 0.2807728
[6,] 0.01085805 1.398194 -0.009371433 0.2690865
[7,] 0.01028249 1.455677 -0.010255476 0.3066875
First, we’ll compute the linear part of the predictor:
Next we’ll apply the inverse link:
The final step is to use these samples for the \(\lambda\) parameter for each individual and simulate actual number of doctor visits using the likelihood:
[1] 15000
Finally, we can answer the original question: What is the probability that the person with poor health will have more doctor visits than the person with good health?
[1] 0.9189333
Because we used our posterior samples for the model parameters in our simulation, this posterior predictive distribution on the number of visits for these two new individuals naturally takes into account our uncertainty in the model estimates. This is a more honest/realistic distribution than we would get if we had fixed the model parameters at their MLE or posterior means and simulated data for the new individuals.
Let’s fit our hierarhical model for counts of chocolate chips. The data can be found in
| chips | location |
|---|---|
| 12 | 1 |
| 12 | 1 |
| 6 | 1 |
| 13 | 1 |
| 12 | 1 |
| 12 | 1 |
| 9 | 1 |
| 10 | 1 |
| 7 | 1 |
| 14 | 1 |
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 30 | 30 | 30 | 30 | 30 |
We can also visualize the distribution of chips by location.
Before implementing the model, we need to select prior distributions for \(\alpha\) and \(\beta\), the hyperparameters governing the gamma distribution for the \(\lambda\) parameters. First, think about what the \(\lambda\)’s represent. For location j, \(\lambda_j\) is the expected number of chocolate chips per cookie. Hence, \(\alpha\) and \(\beta\) control the distribution of these means between locations. The mean of this gamma distribution will represent the overall mean of number of chips for all cookies. The variance of this gamma distribution controls the variability between locations. If this is high, the mean number of chips will vary widely from location to location. If it is small, the mean number of chips will be nearly the same from location to location.
To see the effects of different priors on the distribution of \(\lambda\)’s, we can simulate. Suppose we try independent exponential priors for \(\alpha\) and \(\beta\).
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.021 2.983 9.852 61.127 29.980 4858.786
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1834 3.3663 8.5488 41.8137 22.2219 2865.6461
After simulating from the priors for \(\alpha\) and \(\beta\), we can use those samples to simulate further down the hierarchy:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.171 7.667 83.062 28.621 11005.331
Or for a prior predictive reconstruction of the original data set:
[1] 66.444084 9.946688 6.028319 15.922568 47.978587
[1] 63 58 64 63 70 62 61 48 71 73 70 77 66 60 72 77 69 62 66 71 49 80 66
[24] 75 74 55 62 90 65 57 12 9 7 10 12 10 11 7 14 13 9 6 6 13 7 10
[47] 12 9 9 10 7 8 6 9 7 10 13 13 8 12 6 10 3 6 7 4 6 7 5
[70] 5 4 3 6 2 8 4 8 4 5 7 1 4 5 3 8 8 3 1 7 3 16 14
[93] 13 17 17 12 13 13 16 16 15 14 11 10 13 17 16 19 16 17 15 16 7 17 21
[116] 16 12 15 14 13 52 44 51 46 39 40 40 44 46 59 45 49 58 42 31 52 43 47
[139] 53 41 48 57 35 60 51 58 36 34 41 59
Because these priors have high variance and are somewhat noninformative, they produce unrealistic predictive distributions. Still, enough data would overwhelm the prior, resulting in useful posterior distributions. Alternatively, we could tweak and simulate from these prior distributions until they adequately represent our prior beliefs. Yet another approach would be to re-parameterize the gamma prior, which we’ll demonstrate as we fit the model.
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 150
Unobserved stochastic nodes: 7
Total graph size: 322
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
lam[1] 1 1
lam[2] 1 1
lam[3] 1 1
lam[4] 1 1
lam[5] 1 1
mu 1 1
sig 1 1
Multivariate psrf
1
lam[1] lam[2] lam[3] lam[4] lam[5]
Lag 0 1.000000000 1.000000000 1.000000000 1.00000000 1.0000000000
Lag 1 0.030355236 0.114891517 0.012019002 0.02420960 0.0550134305
Lag 5 -0.011033347 -0.001825634 0.006302119 -0.00340224 -0.0003906281
Lag 10 0.014915445 0.002371482 0.005600358 -0.01068207 0.0023366539
Lag 50 -0.006870905 0.002557641 -0.017849438 -0.01373833 -0.0010541139
mu sig
Lag 0 1.000000000 1.000000000
Lag 1 0.377884708 0.573072968
Lag 5 0.020871448 0.089990578
Lag 10 -0.004499889 0.007465335
Lag 50 0.016514603 0.017038862
lam[1] lam[2] lam[3] lam[4] lam[5] mu sig
14097.507 11079.695 15000.000 14529.931 12590.876 6261.560 3782.793
After assessing convergence, we can check the fit via residuals. With a hierarhcical model, there are now two levels of residuals: the observation level and the location mean level. To simplify, we’ll look at the residuals associated with the posterior means of the parameters.
First, we have observation residuals, based on the estimates of location means.
lam[1] lam[2] lam[3] lam[4] lam[5] mu sig
9.281197 6.226731 9.528052 8.949076 11.758216 9.125364 2.088021
[1] 6.447126
[1] 13.72414
We don’t see any obvious violations of our model assumptions.
Iterations = 2001:7000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
lam[1] 9.281 0.5298 0.004325 0.004467
lam[2] 6.227 0.4600 0.003756 0.004370
lam[3] 9.528 0.5445 0.004446 0.004446
lam[4] 8.949 0.5260 0.004295 0.004365
lam[5] 11.758 0.6150 0.005021 0.005504
mu 9.125 0.9896 0.008080 0.012812
sig 2.088 0.7089 0.005788 0.011699
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
lam[1] 8.273 8.918 9.274 9.633 10.357
lam[2] 5.363 5.907 6.212 6.534 7.166
lam[3] 8.491 9.157 9.514 9.888 10.622
lam[4] 7.944 8.591 8.937 9.296 10.003
lam[5] 10.582 11.339 11.746 12.164 12.996
mu 7.257 8.508 9.090 9.705 11.204
sig 1.102 1.590 1.953 2.444 3.816
We can extend the linear model for the Leinhardt data on infant mortality by incorporating the region variable. We’ll do this with a hierarhcical model, where each region has its own intercept.
'data.frame': 105 obs. of 4 variables:
$ income: int 3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
$ infant: num 26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
$ region: Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
$ oil : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
income infant region oil
Australia 3426 26.7 Asia no
Austria 3350 23.7 Europe no
Belgium 3346 17.0 Europe no
Canada 4751 16.8 Americas no
Denmark 5029 13.5 Europe no
Finland 3312 10.1 Europe no
Previously, we worked with infant mortality and income on the logarithmic scale. Recall also that we had to remove some missing data.
'data.frame': 101 obs. of 6 variables:
$ income : int 3426 3350 3346 4751 5029 3312 3403 5040 2009 2298 ...
$ infant : num 26.7 23.7 17 16.8 13.5 10.1 12.9 20.4 17.8 25.7 ...
$ region : Factor w/ 4 levels "Africa","Americas",..: 3 4 4 2 4 4 4 4 4 4 ...
$ oil : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ logincome: num 8.14 8.12 8.12 8.47 8.52 ...
$ loginfant: num 3.28 3.17 2.83 2.82 2.6 ...
- attr(*, "na.action")=Class 'omit' Named int [1:4] 24 83 86 91
.. ..- attr(*, "names")= chr [1:4] "Iran" "Haiti" "Laos" "Nepal"
Now we can fit the proposed model:
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
[36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 2 3 4
0 31 20 24 18
1 3 2 3 0
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 101
Unobserved stochastic nodes: 9
Total graph size: 639
Initializing model
Potential scale reduction factors:
Point est. Upper C.I.
a[1] 1 1.01
a[2] 1 1.01
a[3] 1 1.01
a[4] 1 1.01
a0 1 1.00
b[1] 1 1.01
b[2] 1 1.00
sig 1 1.00
tau 1 1.00
Multivariate psrf
1
a[1] a[2] a[3] a[4] a0 b[1]
Lag 0 1.0000000 1.0000000 1.0000000 1.0000000 1.00000000 1.0000000
Lag 1 0.9129166 0.9173458 0.9125335 0.9313257 0.26347530 0.9784253
Lag 5 0.8354717 0.8419625 0.8390327 0.8547619 0.24745907 0.8966903
Lag 10 0.7448016 0.7507876 0.7480949 0.7600529 0.21619007 0.7985298
Lag 50 0.2961008 0.2981225 0.3033191 0.3046762 0.06845489 0.3158794
b[2] sig tau
Lag 0 1.000000000 1.0000000000 1.000000000
Lag 1 0.136754030 0.0603107165 0.258056648
Lag 5 0.026978792 0.0097383706 0.001325742
Lag 10 0.032334710 0.0194463099 0.019528626
Lag 50 0.007282582 0.0003911277 -0.010478866
a[1] a[2] a[3] a[4] a0 b[1]
185.1849 186.3214 172.3905 183.7944 846.4287 163.5538
b[2] sig tau
5881.4342 12831.2890 8977.0617
When communicating results from any analysis, a responsible statistician will report and justify modeling decisions, especially assumptions. In a Bayesian analysis, there is another assumption that is open to scrutiny: the choices of prior distributions. In the models considered so far in this course, there are an infinite number of prior distributions we could have chosen from.
How do you justify the priors you choose? If they truly represent your beliefs about the parameters before analysis and the model is appropriate, then the posterior distribution truly represents your updated beliefs. If you don’t have any strong beliefs beforehand, there are often default, reference, or noninformative prior options, and you will have to select one. However, a collaborator or a boss (indeed, somebody somewhere) may not agree with your choice of prior. One way to increase the credibility of your results is to repeat the analysis under a variety of priors, and report how the results differ as a result. This process is called prior sensitivity analyis.
At a minimum you should always report your choice of model and prior. If you include a sensitivity analysis, select one or more alternative priors and describe how the results of the analysis change. If they are sensitive to the choice of prior, you will likely have to explain both sets of results, or at least explain why you favor one prior over another. If the results are not sensitive to the choice of prior, this is evidence that the data are strongly driving the results. It suggests that different investigators coming from different backgrounds should come to the same conclusions.
If the purpose of your analysis is to establish a hypothesis, it is often prudent to include a ``skeptical" prior which does not favor the hypothesis. Then, if the posterior distribution still favors the hypothesis despite the unfavorable prior, you will be able to say that the data substantially favor the hypothesis. This is the approach we will take in the following example, continued from the previous lesson.
Let’s return to the example of number of doctor visits (Poisson regression). We concluded from our previous analysis of these data that both bad health and increased age are associated with more visits. Suppose the burden of proof that bad health is actually associated with more visits rests with us, and we need to convince a skeptic.
First, let’s re-run the original analysis and remind ourselves of the posterior distribution for the badh (bad health) indicator.
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1127
Unobserved stochastic nodes: 4
Total graph size: 3673
Initializing model
Essentially all of the posterior probability mass is above 0, suggesting that this coefficient is positive (and consequently that bad health is associated with more visits). We obtained this result using a relatively noninformative prior. What if we use a prior that strongly favors values near 0? Let’s repeat the analysis with a normal prior on the badh coefficient that has mean 0 and standard deviation 0.2, so that the prior probability that the coefficient is less than 0.6 is >0.998>0.998. We’ll also use a small variance on the prior for the interaction term involving badh (standard deviation 0.01 because this coefficient is on a much smaller scale).
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1127
Unobserved stochastic nodes: 4
Total graph size: 3679
Initializing model
How did the posterior distribution for the coefficient of badh change?
Under the skeptical prior, our posterior distribution for b_badh has significantly dropped to between about 0.6 and 1.1. Although the strong prior influenced our inference on the magnitude of the bad health effect on visits, it did not change the fact that the coefficient is significantly above 0. In other words: even under the skeptical prior, bad health is associated with more visits, with posterior probability near 1.
We should also check the effect of our skeptical prior on the interaction term involving both age and health.
[1] 0.9486
The result here is interesting. Our estimate for the interaction coefficient has gone from negative under the noninformative prior to positive under the skeptical prior, so the result is sensitive. In this case, because the skeptical prior shrinks away much of the bad health main effect, it is likely that this interaction effect attempts to restore some of the positive effect of bad health on visits. Thus, despite some observed prior sensitivity, our conclusion that bad health positively associates with more visits remains unchanged.
Efron, B., 2013. Bayes theorem in the 21st century. Science 340, 1177–1178.